Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gix-protocol::fetch() without Delegates #1634

Merged
merged 7 commits into from
Dec 16, 2024
Merged

gix-protocol::fetch() without Delegates #1634

merged 7 commits into from
Dec 16, 2024

Conversation

Byron
Copy link
Member

@Byron Byron commented Oct 18, 2024

Having learned from gix it should now be possible to have a command-oriented fetch
that is similar to gix_protocol::ls_refs().

Motivated by #1319, as these changes should allow heartwood to use most of gix-protocol
without duplicating a lot of code.

Tasks

  • make current fetch() test-only
  • RefMap transplanted
    • Connection uses utility transport type to avoid drop impl.
    • negotiation
    • pack reception
  • let gitoxide-core use new protocol facilities.
  • re-implement the test-only fetch with the command-oriented fetch, maybe provide it
    as utility in gix-protocol even? That would allow the Delegates to remain, but
    be specific to this particularly 'simple' way of performing a fetch.
    • Actually, the old tests are fine and profit from the old implementation. Not worth changing this now, especially since the gix tests are more exhaustive and real.

Research

  • loose types
    • RefMap
    • Connection
      • transport
      • refmap
      • all kinds of refspecs
      • object_hash (only for writing a pack, probably good to keep this outside), but also needed for pre-check
  • Steps
    • ls-refs generates RefMap
    • negotiate - needs a lot of state, particularly references, traversal, peeling
    • receive pack
    • update refs

@Byron Byron mentioned this pull request Oct 18, 2024
2 tasks
@Byron Byron force-pushed the remove-delegates branch 2 times, most recently from b5071ed to 437be4f Compare December 9, 2024 19:13
This is a first step towards creating a multi-step `fetch()` which can
also be used in the `gix` crate.
@Byron Byron force-pushed the remove-delegates branch 2 times, most recently from 842b0de to 42bc48e Compare December 11, 2024 16:23
…n error.

This makes the system more reliable, particularly when certain operations are
performed on Drop, and can happen prematurely.
@Byron Byron force-pushed the remove-delegates branch 2 times, most recently from 203e91e to 6e506ca Compare December 14, 2024 18:31
That way it's easier to reuse shallow-handling code from plumbing crates.

Note that this is a breaking change as `gix-protocol` now uses the `gix-shallow::Update`
type, which doesn't implement a formerly public `from_line()` method anymore.
Now it is available as `fetch::response::shallow_update_from_line()`.
@Byron Byron force-pushed the remove-delegates branch 4 times, most recently from 29950a7 to e9e1553 Compare December 16, 2024 17:38
For now, just move the code down and immediately re-integrate in `gix`
to be able to use its tests to validate it.

This is a breaking change as some types move and change the layout.
@Byron Byron marked this pull request as ready for review December 16, 2024 17:41
@Byron Byron force-pushed the remove-delegates branch 3 times, most recently from 00d866f to 5f11cc7 Compare December 16, 2024 18:56
@Byron Byron enabled auto-merge December 16, 2024 18:56
@@ -107,7 +107,9 @@ title "gix (with repository)"
)
fi

(with "git:// protocol"
# for some reason, on CI the daemon always shuts down before we can connect,
# or isn't actually ready despite having accepted the first connection already.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EliahKagan Just to share, this is an oddity that might just be coincidence, or it is indeed very subtly related to the code change here.

What happens is that on CI it can never connect to the Git daemon that it just started. The script only proceeds once the first connection was established using nc. For wall I can tell, the server is down and doesn't accept connections at all, when a connection is attempted. The code doing that didn't even change. Only the shutdown of the connection (with EOF packet being sent) happens on Drop instead of explicitly. But for all I can tell, it doesn't even get there. And the first connection attempt already fails, so it's not that a previous connection somehow messed up the daemon.

It's very strange.

Locally, it would sometimes reproduce, but also sometimes it could work for 10 times in a row.
It seems the Git daemon doesn't work on CI at all anymore.

With debugging, the daemon prints this on success:

it-daemon
19:45:49.427373 exec-cmd.c:238          trace: resolved executable dir: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core
[86841] Connection from [::1]:64256
fatal: the remote end hung up unexpectedly
[86839] [86841] Disconnected (with error)
19:45:49.448501 run-command.c:655       trace: run_command: REMOTE_ADDR='[::1]' REMOTE_PORT=64257 /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-daemon --serve --verbose --base-path=. --export-all --user-path
19:45:49.451073 exec-cmd.c:139          trace: resolved executable path from Darwin stack: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-daemon
19:45:49.451788 exec-cmd.c:238          trace: resolved executable dir: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core
[86853] Connection from [::1]:64257
19:45:49.451957 pkt-line.c:80           packet:          git< git-upload-pack /\0host=localhost\0
[86853] Extended attribute "host": localhost
[86853] Request upload-pack for '/'
19:45:49.452171 run-command.c:655       trace: run_command: git upload-pack --strict --timeout=0 .
[86853] 19:45:49.454781 exec-cmd.c:139          trace: resolved executable path from Darwin stack: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git
[86853] 19:45:49.455688 exec-cmd.c:238          trace: resolved executable dir: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core
[86853] 19:45:49.455955 git.c:460               trace: built-in: git upload-pack --strict --timeout=0 .
[86853] 19:45:49.456947 pkt-line.c:80           packet:  upload-pack> 3f72b39ad1600e6dac63430c15e0d875e9d3f9d6 HEAD\0multi_ack thin-pack side-band side-band-64k ofs-delta shallow deepen-since deepen-not deepen-relative no-progress include-tag multi_ack_detailed symref=HEAD:refs/heads/main object-format=sha1 agent=git/2.39.5.(Apple.Git-154)
[86853] 19:45:49.458105 pkt-line.c:80           packet:  upload-pack> ee3c97678e89db4eab7420b04aef51758359f152 refs/heads/dev
[86853] 19:45:49.458150 pkt-line.c:80           packet:  upload-pack> 3f72b39ad1600e6dac63430c15e0d875e9d3f9d6 refs/heads/main
[86853] 19:45:49.458153 pkt-line.c:80           packet:  upload-pack> feae03400632392a7f38e5b2775f98a439f5eaf5 refs/tags/annotated
[86853] 19:45:49.458245 pkt-line.c:80           packet:  upload-pack> ee3c97678e89db4eab7420b04aef51758359f152 refs/tags/annotated^{}
[86853] 19:45:49.458248 pkt-line.c:80           packet:  upload-pack> efa596d621559707b2d221f10490959b2decbc6c refs/tags/unannotated
[86853] 19:45:49.458342 pkt-line.c:80           packet:  upload-pack> 0000
[86853] 19:45:49.459051 pkt-line.c:80           packet:  upload-pack< 0000
[86853] 19:45:49.459055 trace.c:411             performance: 0.003111000 s: git command: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git upload-pack --strict --timeout=0 .
[86839] [86853] Disconnected

           [with] version 2
              [it] generates the correct output19:45:49.483351 run-command.c:655       trace: run_command: REMOTE_ADDR='[::1]' REMOTE_PORT=64258 /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-daemon --serve --verbose --base-path=. --export-all --user-path
19:45:49.485905 exec-cmd.c:139          trace: resolved executable path from Darwin stack: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-daemon
19:45:49.486612 exec-cmd.c:238          trace: resolved executable dir: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core
[86868] Connection from [::1]:64258
19:45:49.486782 pkt-line.c:80           packet:          git< git-upload-pack /\0host=localhost\0\0version=2\0
[86868] Extended attribute "host": localhost
[86868] Extended attribute "protocol": version=2
[86868] Request upload-pack for '/'
19:45:49.487010 run-command.c:655       trace: run_command: GIT_PROTOCOL=version=2 git upload-pack --strict --timeout=0 .
[86868] 19:45:49.489777 exec-cmd.c:139          trace: resolved executable path from Darwin stack: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git
[86868] 19:45:49.490572 exec-cmd.c:238          trace: resolved executable dir: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core
[86868] 19:45:49.490861 git.c:460               trace: built-in: git upload-pack --strict --timeout=0 .
[86868] 19:45:49.490985 pkt-line.c:80           packet:  upload-pack> version 2
[86868] 19:45:49.491008 pkt-line.c:80           packet:  upload-pack> agent=git/2.39.5.(Apple.Git-154)
[86868] 19:45:49.491234 pkt-line.c:80           packet:  upload-pack> ls-refs=unborn
[86868] 19:45:49.491257 pkt-line.c:80           packet:  upload-pack> fetch=shallow wait-for-done ref-in-want
[86868] 19:45:49.491260 pkt-line.c:80           packet:  upload-pack> server-option
[86868] 19:45:49.491263 pkt-line.c:80           packet:  upload-pack> object-format=sha1
[86868] 19:45:49.491265 pkt-line.c:80           packet:  upload-pack> object-info
[86868] 19:45:49.491267 pkt-line.c:80           packet:  upload-pack> 0000
[86868] 19:45:49.491621 pkt-line.c:80           packet:  upload-pack< command=ls-refs
[86868] 19:45:49.491625 pkt-line.c:80           packet:  upload-pack< agent=git/oxide-0.68.0
[86868] 19:45:49.491626 pkt-line.c:80           packet:  upload-pack< 0001
[86868] 19:45:49.491628 pkt-line.c:80           packet:  upload-pack< symrefs
[86868] 19:45:49.491633 pkt-line.c:80           packet:  upload-pack< peel
[86868] 19:45:49.491634 pkt-line.c:80           packet:  upload-pack< unborn
[86868] 19:45:49.491636 pkt-line.c:80           packet:  upload-pack< 0000
[86868] 19:45:49.492175 pkt-line.c:80           packet:  upload-pack> 3f72b39ad1600e6dac63430c15e0d875e9d3f9d6 HEAD symref-target:refs/heads/main
[86868] 19:45:49.492607 pkt-line.c:80           packet:  upload-pack> ee3c97678e89db4eab7420b04aef51758359f152 refs/heads/dev
[86868] 19:45:49.492610 pkt-line.c:80           packet:  upload-pack> 3f72b39ad1600e6dac63430c15e0d875e9d3f9d6 refs/heads/main
[86868] 19:45:49.492711 pkt-line.c:80           packet:  upload-pack> feae03400632392a7f38e5b2775f98a439f5eaf5 refs/tags/annotated peeled:ee3c97678e89db4eab7420b04aef51758359f152
[86868] 19:45:49.492760 pkt-line.c:80           packet:  upload-pack> efa596d621559707b2d221f10490959b2decbc6c refs/tags/unannotated
[86868] 19:45:49.492763 pkt-line.c:80           packet:  upload-pack> 0000
[86868] 19:45:49.492979 pkt-line.c:80           packet:  upload-pack< 0000
[86868] 19:45:49.493028 trace.c:411             performance: 0.002177000 s: git command: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git upload-pack --strict --timeout=0 .
[86839] [86868] Disconnected

I diffed that output (cutting away the first three columns) in comparison with a version from main, and it was the same (only ports were different).

On failure, there was no output that would say anything - no connection was established except for the first one to test if the server is up.

And now there is the next failure, this time with code that fetches a pack.

Hmm… for now I feel forced to disable these tests, but can't shake the feeling that this is telling something 😅

Copy link
Member

@EliahKagan EliahKagan Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it fails, could one thing be using IPV4 127.0.0.1 and another thing be using IPv6 ::1?

This seems like a long shot and is somewhat conceptually unformed ("one/another thing"). But I may need to go away from my computer for a while, so I figured I'd mention my one idea so far, in case somehow it turns out to be relevant and could keep the test from having to be turned off on CI.

Copy link
Member Author

@Byron Byron Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the error message I think it is clear that the Git daemon socket isn't open anymore, and that before the first 'real' connection. Maybe there is merit in getting details on the error code at least. stdout/stderr would be visible otherwise, so it would be a natural exit maybe.

Very strange still, also locally it's pretty stable, albeit not perfect, but on CI the failure rate is basically 100% now.

I will forcefully push past this for now, and hope it's truly something related to the Git daemon itself. Maybe it's just timing and main (pre-merge) would also be broken now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the error message I think it is clear that the Git daemon socket isn't open anymore, and that before the first 'real' connection.

Ah, okay.

Very strange still, also locally it's pretty stable, albeit not perfect, but on CI the failure rate is basically 100% now.

I will forcefully push past this for now, and hope it's truly something related to the Git daemon itself. Maybe it's just timing and main (pre-merge) would also be broken now?

I reran CI in my fork at faa0cde, the tip of main before this PR was merged. All tests passed.

On CI, the journey tests run only on Ubuntu, because they are run from the CI test job, which only runs on ubuntu-latest. Is it possible that what makes a difference is whether they are run on macOS versus GNU/Linux? (My guess is that you are most often running tests on macOS when you run them locally, and also the output you showed seems to have been from a macOS system.) Could they be usually but not always passing on macOS, while rarely passing on GNU/Linux?

Another difference, in comparing ubuntu-latest CI runs to local macOS runs, is in versions of Git, since ubuntu-latest usually has the latest stable version of Git that was available when the most recent production ubuntu-latest GHA image was built (which is usually recently).

These things, at least the question of whether local GNU/Linux runs succeed, seems like the kind of thing I could just check... and I suspect I will be able to find out, but I am not certain, because I'm not clear on how, exactly, you are running the tests locally. In the CI test output you showed, the displayed failure occurred in tests of the max feature and not of others. But I don't know if that's something that usually happens, or if it's just nondeterministic and happened to happen that time. Furthermore, if it does usually happen with max and not the others that are tested earlier, I don't know if that's because of a difference in the features themselves, or if it's just that running the tests multiple times in succession in the same environment causes them to fail due to some kind of effect of the previous runs.

Copy link
Member

@EliahKagan EliahKagan Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked into this further and found some useful things, though not a full solution.

A simple, but strange, workaround

I tried reenabling the affected journey tests on CI on a feature branch, to verify that they fail there, which they do. On a separate throwaway test branch, I made the test-journey job into a matrix job and ran it not just on ubuntu-latest but also on macos-latest. I can get the job to fail on macos-latest on rare occasion, but it almost always succeeds, while in contrast it always fails on ubuntu-latest. (I've verified that ubuntu-latest is picking an Ubuntu 24.04 LTS runner in all these tests, including the one you linked above, so that's not the issue.) This seems to accord with the distinction you've reported.

On the feature branch with the affected journey tests enabled, I then tried removing the rust-cache step. That fixed it! I don't really know why this is, though I have some ideas, and it is probably a workaround rather than a real fix. Nonetheless, I ran the tests on CI more than 15 times, and they always passed. Although I expect to be able to reach a real fix or at least a better workaround, for now this is simple and works, and I may not have a chance to work on this again for a day or two, so I plan to open a PR [edit: #1725] just with the change of turning off caching. Since the journey tests run in their own job, turning off caching doesn't slow down any non-journey tests. With caching turned off, the test-journey job usually takes between 6:00 and 6:30 to run, which I think is probably acceptable.

My experiments so far: idea 1

There are two reasons turning off caching may have helped. One is if artifacts from the cache are affecting the behavior. That shouldn't happen, though, since any stale versions of dependencies should automatically not be used. The other reason, though, is timing: without caching, there are longer delays, due to more compilation, not just before any journey tests can run, but between the different runs of the journey tests that are done using different feature sets. Perhaps whatever condition triggers the problem corrects itself, given sufficient time between runs.

Returning to my initial idea about ::1 vs. 127.0.0.1, in that test run (linked there), as well as in all of the various test runs I've done since that have reproduced the problem, nc reports it has connected to ::1 for the several tests using git-daemon that succeed, and reports it has connected to 127.0.0.1 for the test using git-daemon that fails.

This suggests that, in the failing test, git-daemon can't bind ::1 and binds 127.0.0.1 instead, or, alternatively, that for some reason, it decides to bind 127.0.0.1 and not ::1, while gix still uses ::1. I don't know why this would happen, since I would think that it should be binding both if they both exist, unless directed otherwise.

Such a situation could happen if a previous run of git-daemon still has the port, and that's what most signs point to, even separately from the address disparity: the nc command checks that it can connect, but it does not check that it is connecting to the instance of git-daemon that was just run. Furthermore, there are two reasons to suspect that a previous git-daemon instance could still be running:

  • The git process (started by the git -c ... daemon ... command) actually forks before it execs git-daemon from its git-core directory and both processes are typically running. (This can be observed with pstree.) I've verified that this happens on both GNU/Linux and macOS. $! is the PID of the git parent, not the git-daemon child. If the EXIT trap either fails to run or the SIGTERM signal its kill command sends to the parent git process does not result in the git-daemon child process terminating, and if nothing else terminates it, then it will still be running.
  • set -e will not catch a failure from an asynchronous ... & command that is able to create an asynchronous process, since its status will be successful even if the asynchronous process immediately fails.

However...that doesn't seem to be happening. If I instrument the code with a pre-check using nc that forces failure if the port is already open, that failure does not occur: the port is closed before each test runs git daemon. This holds up even if both IPv4 and IPv6 localhost are checked with separate nc commands.

And yet...

My experiments so far: idea 2

...it cannot be a coincidence that nc always reports connecting to ::1 if gix is about to succeed, and always reports connecting to 127.0.01 if gix is about to fail. What is gix connecting to?

It seems, for some reason, that it is connecting to ::1. I don't know if this is due to change in the behavior of gix, or if it is instead brought about by timing. This could still even be entirely due the journey test shell scripts and git behavior, since what it connects to is at least in part related to what interfaces are being listened on. I am not sure.

The reason I think--or, I should say, thought--gix is treating localhost to mean ::1, even when nc has treated it to mean 127.0.0.1, is that it looks like the problem went away when I added --listen=::1 to the git daemon command and modified the nc -z loop that follows it to connect to ::1. But I am not sure that was a correct observation, because when I do this now, it is taking forever, even on tests that had worked before with ::1: somehow, even where nc had reported connecting to ::1, it cannot do so when passed ::1.

Anyway, even if further tweaks along these lines--which I do plan to attempt--does make it work, there will be two problems:

  • Viewing things narrowly, it would be nice to keep the journey tests broadly runnable locally, and not everybody has both IPv4 and IPv6 set up, even on the loopback interface. Also, though maybe this less significant, I have found that not all implementations of nc accept -6 or even recognize ::1: when I install gnu-netcat on an Arch Linux system (that does have IPv6 and I can ping ::1), neither works.
  • Viewing things broadly, the problem is that both the invocation of git daemon and the access to it by gix should, if possible, work on all systems that have at least one of IPv4 and IPv6. If that is not feasible, then the reason, and whatever limitations it imposes, should be documented.

In addition, while it seems to me that these various signs of the IPv4/IPv6 distinction mattering are significant, it might still turn out that this general idea of mine is mistaken--with some other explanation for the coincidence--and that this neither causes nor can be used to fix the problem.

Current status and branch links

There are three relevant branches in my fork. The least interesting is run-ci/macos, where I experimented with running the journey tests on CI on macOS as well as Ubuntu. The other two are interesting. Of them, run-ci/daemon is the most useful because it contains the removal of caching that works around the problem, while run-ci/daemon-next consists of it with more commits trying various things: the things described above, and a bit more. (39a1cc7 is the first commit of run-ci/daemon-next not on run-ci/daemon.)

Of these, only run-ci/daemon would be suitable for merging at this time, so I'll only open a PR for it. [Edit: As noted above, I've opened #1725 for this.] These are the repeated test runs that pass with that change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good to know, thank you! This means that this PR does contains a change that causes the failures.

On MacOS, I run j journey-tests with git version 2.39.5 (Apple Git-154). Indeed, if the errors show they only show with the max feature. It's also true that the failures only happen with the max future, but always work with the max-pure and async versions of the binaries.

max-pure definitely is the same when it comes to setting up blocking TCP connections to the local server compared to max.

Also known is that in the failing max run of the journey tests, the git-daemon binary comes up and accepts its first (fake) connection, but is down once gix tries to connect.

After running it 3 times successfully, the fourth time I saw this:

[15435] 07:25:25.095601 pkt-line.c:80           packet:  upload-pack> efa596d621559707b2d221f10490959b2decbc6c refs/tags/unannotated
[15435] 07:25:25.095603 pkt-line.c:80           packet:  upload-pack> 0000
[15435] 07:25:25.095791 pkt-line.c:80           packet:  upload-pack< 0000
[15435] 07:25:25.095794 trace.c:411             performance: 0.001274000 s: git command: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git upload-pack --strict --timeout=0 .
[15209] [15435] Disconnected

/Users/byron/dev/github.com/GitoxideLabs/gitoxide/tests/journey/gix.sh: line 160: kill: (15405) - No such process
error: Recipe `journey-tests` failed on line 204 with exit code 1

gitoxide ( main) +11 -2 [$!] took 3s
❯ ps aux | grep git
byron            15467   0.0  0.0 410059184    192 s002  R+    7:25AM   0:00.00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox --exclude-dir=.venv --exclude-dir=venv git
byron            15209   0.0  0.0 410754192   4288 s002  S     7:25AM   0:00.00 /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-daemon --verbose --base-path=. --export-all --user-path
byron            15206   0.0  0.0 410736816   4224 s002  S     7:25AM   0:00.01 /Applications/Xcode.app/Contents/Developer/usr/bin/git -c uploadpack.allowrefinwant daemon --verbose --base-path=. --export-all --user-path

gitoxide ( main) +11 -2 [$!]

What happened is that an upload-pack child sub-process was 'stuck' in the prior run, so the current run was unable to bring up a new git daemon (line 160: kill: (15405) - No such process.

Upon running the journey test with this patch applied again and letting it finish…

diff --git a/tests/helpers.sh b/tests/helpers.sh
index 2f435a553..0b45e34b5 100644
--- a/tests/helpers.sh
+++ b/tests/helpers.sh
@@ -60,7 +60,16 @@ function small-repo-in-sandbox() {
 }
 
 function launch-git-daemon() {
-    git -c uploadpack.allowrefinwant daemon --verbose --base-path=. --export-all --user-path &>/dev/null &
+    GIT_TRACE=true \
+    GIT_TRACE_PACK_ACCESS=true \
+    GIT_TRACE_PACKET=true \
+    GIT_TRACE_PACKFILE=true \
+    GIT_TRACE_PERFORMANCE=true \
+    GIT_TRACE_SHALLOW=true \
+    GIT_TRACE_SETUP=true \
+    GIT_CURL_VERBOSE=true \
+    GIT_SSH_COMMAND="ssh -VVV" \
+    git -c uploadpack.allowrefinwant daemon --verbose --base-path=. --export-all --user-path  &
     daemon_pid=$!
     while ! nc -z localhost 9418; do
       sleep 0.1
diff --git a/tests/journey.sh b/tests/journey.sh
index 63771e415..c1a110681 100755
--- a/tests/journey.sh
+++ b/tests/journey.sh
@@ -36,5 +36,5 @@ export LC_ALL=C
 
 set-static-git-environment
 
-source "$root/journey/ein.sh"
+#source "$root/journey/ein.sh"
 source "$root/journey/gix.sh"

… I saw this:

              [it] generates the correct output in JSON format
07:31:51.918740 exec-cmd.c:139          trace: resolved executable path from Darwin stack: /Applications/Xcode.app/Contents/Developer/usr/bin/git
07:31:51.919058 exec-cmd.c:238          trace: resolved executable dir: /Applications/Xcode.app/Contents/Developer/usr/bin
07:31:51.919414 git.c:750               trace: exec: git-daemon --verbose --base-path=. --export-all --user-path
07:31:51.919423 run-command.c:655       trace: run_command: git-daemon --verbose --base-path=. --export-all --user-path
07:31:51.921568 exec-cmd.c:139          trace: resolved executable path from Darwin stack: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-daemon
07:31:51.921777 exec-cmd.c:238          trace: resolved executable dir: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core
[16778] Could not bind to ::: Address already in use
[16778] Ready to rumble
07:31:52.021314 run-command.c:655       trace: run_command: REMOTE_ADDR=127.0.0.1 REMOTE_PORT=65349 /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-daemon --serve --verbose --base-path=. --export-all --user-path
Connection to localhost port 9418 [tcp/git] succeeded!
        [with] version 1
07:31:52.023856 exec-cmd.c:139          trace: resolved executable path from Darwin stack: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-daemon
07:31:52.024359 exec-cmd.c:238          trace: resolved executable dir: /Applications/Xcode.app/Contents/Developer/usr/libexec/git-core
           [it] generates the correct output[16781] Connection from 127.0.0.1:65349
fatal: the remote end hung up unexpectedly
[16778] [16781] Disconnected (with error)
 - FAIL
$ /Users/byron/dev/github.com/GitoxideLabs/gitoxide/target/debug/gix --no-verbose --config protocol.version=1 remote --name git://localhost/ refs
Expected actual status 1 to be 0
Error: connection failed

Caused by:
    0: An IO error occurred when connecting to the server
    1: Connection refused (os error 61)
error: Recipe `journey-tests` failed on line 204 with exit code 1

gitoxide ( main) +11 -2 [$!] took 3s

And checking for stray processes yielded nothing:

❯ ps aux | grep git
byron            16899   0.0  0.0 410724160   1536 s002  S+    7:32AM   0:00.00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox --exclude-dir=.venv --exclude-dir=venv git

I think what's happening here is a race condition: the previous run creates a Git daemon which actually fails as the address is already in use. However, it doesn't notice this as it backgrounds the process, and instead tries to connect which succeeds… as it connects to the previous instance of the git daemon which is still in flight but on the way to shut down.

This is why seemingly it can connect to the Git instance, but is unable to connect right after.

It seemed like an easy fix, just check if there still is someone listening and if so, wait:

    while nc -z localhost 9418; do
      echo "waiting for previous daemon to shut down"
      sleep 1
    done

However, this never kicks in as apparently it's totally possible that one cannot connect to a port, but it's also not possible to listen to the same port (yet).

The alternative was to try to bind to the port yourself to see if it works, but that would cause the same issue that this was trying to combat.

One might think that a blank sleep 1 would do it, but think again. It's just like git-daemon often cannot open a socket, maybe it's E_TRYAGAIN and it just doesn't care.

Looking at the code it is indeed clear that it doesn't care about the actual error code, so listen to sockets seems flaky at least if the socket was recently in use.

@Byron Byron merged commit ddeb97f into main Dec 16, 2024
20 checks passed
@Byron Byron deleted the remove-delegates branch December 16, 2024 21:15
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Dec 17, 2024
This is part of investigating the problem of not being able to
connect to the git daemon. This problem is described in:
GitoxideLabs#1634 (review)

The preceding commit, bf2abdb,
which this reverts, omitted caching from the `test-journey` CI job,
and that somehow avoided the problem, causing all tests to pass.
With that change, the failure did not recur even after rerunning
the job more than 15 times:
https://github.com/EliahKagan/gitoxide/actions/runs/12363538568

That is strange, because no cause related to Rust dependencies is
apparent: The Git daemon is not itself provided by or expected to
be affected by them, and any stale dependencies from the cache
should automatically not be used.

However, perhaps the extra *delays*, incurred by not having cached
some artifacts and having to rebuild them, allowed the
failure-triggering condition to resolve on its own. If so, that
would explain this observation.

It would also be consistent with how the problem arose with the
changes in GitoxideLabs#1634, if a previous delay existed that happened to
prevent the bug from being observed and that GitoxideLabs#1634 somehow
counteracted. This could be either by a performance improvement,
or by shifting the delays to other times during the job (by
changing which artifacts were needed for which tests).

The `test-journey` job usually takes between 6:00 an 6:30 without
caching build artifacts. This is arguably not too bad, so it might
be feasible to use that change as a long-term workaround. But if
possible the underlying problem should be identified and fixed.

So this commit brings back caching to check if it reliably restores
the bug. If so, then experiments and other fixes or workarounds can
be attempted on top of that.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Dec 17, 2024
In each CI run with failing journey tests due to not being able to
connect to `git daemon`, there are also passing tests, and in each
run that I have observed, including an early run presented in GitoxideLabs#1634
(https://github.com/GitoxideLabs/gitoxide/actions/runs/12359365805/job/34491924319?pr=1634#step:6:1049),
all passing journey tests (of those with non-skipped contents) that
connect to the git daemon do so after `nc` has reported success
connecting to `::1` (IPv6 loclahost), and the failing journey test
that attempts to connect to the git daemon does so after `nc` has
reported success connecting to `127.0.0.1` (IPv4 localhost).

That doesn't clearly elucidate what would be going wrong, and it is
possible that the problem is unrelaced, but it looks worthwhile to
check into it.

At this point, the git daemon is still not told what to listen on.
First I just want to see what `nc` reports when checking for both
to ensure that nothing is listening on either of them each time
`git-daemon` is about to be run.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Dec 18, 2024
This is part of investigating the problem of not being able to
connect to the git daemon. This problem is described in:
GitoxideLabs#1634 (review)

The preceding commit, bf2abdb,
which this reverts, omitted caching from the `test-journey` CI job,
and that somehow avoided the problem, causing all tests to pass.
With that change, the failure did not recur even after rerunning
the job more than 15 times:
https://github.com/EliahKagan/gitoxide/actions/runs/12363538568

That is strange, because no cause related to Rust dependencies is
apparent: The Git daemon is not itself provided by or expected to
be affected by them, and any stale dependencies from the cache
should automatically not be used.

However, perhaps the extra *delays*, incurred by not having cached
some artifacts and having to rebuild them, allowed the
failure-triggering condition to resolve on its own. If so, that
would explain this observation.

It would also be consistent with how the problem arose with the
changes in GitoxideLabs#1634, if a previous delay existed that happened to
prevent the bug from being observed and that GitoxideLabs#1634 somehow
counteracted. This could be either by a performance improvement,
or by shifting the delays to other times during the job (by
changing which artifacts were needed for which tests).

The `test-journey` job usually takes between 6:00 an 6:30 without
caching build artifacts. This is arguably not too bad, so it might
be feasible to use that change as a long-term workaround. But if
possible the underlying problem should be identified and fixed.

So this commit brings back caching to check if it reliably restores
the bug. If so, then experiments and other fixes or workarounds can
be attempted on top of that.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Dec 18, 2024
In each CI run with failing journey tests due to not being able to
connect to `git daemon`, there are also passing tests, and in each
run that I have observed, including an early run presented in GitoxideLabs#1634
(https://github.com/GitoxideLabs/gitoxide/actions/runs/12359365805/job/34491924319?pr=1634#step:6:1049),
all passing journey tests (of those with non-skipped contents) that
connect to the git daemon do so after `nc` has reported success
connecting to `::1` (IPv6 loclahost), and the failing journey test
that attempts to connect to the git daemon does so after `nc` has
reported success connecting to `127.0.0.1` (IPv4 localhost).

That doesn't clearly elucidate what would be going wrong, and it is
possible that the problem is unrelaced, but it looks worthwhile to
check into it.

At this point, the git daemon is still not told what to listen on.
First I just want to see what `nc` reports when checking for both
to ensure that nothing is listening on either of them each time
`git-daemon` is about to be run.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Dec 18, 2024
This reverts commit bf2abdb, the
supporting change in GitoxideLabs#1725 that refrained from caching build
artifacts from the `test-journey` job, since that no longer needs
to be avoided now that `git-daemon` journey tests aren't run on CI.

In 9566488 (GitoxideLabs#1634), journey tests that use `git-daemon` were
disabled on CI. GitoxideLabs#1725 tried reenabling them (65788e8) and fixing
them by no longer caching build artifacts with the `rust-cache`
action in the `test-journey` job (bf2abdb). This worked at the
time, but the exact reason it worked or how reliable it would be
were unknown. It stopped working shortly thereafter in 25b8480
(GitoxideLabs#1726), and those journey tests were re-disabled on CI in d1d3f7c,
which reverted 65788e8. However, bf2abdb was not reverted at that
time.

(The change in d1d3f7c from GitoxideLabs#1726, taken together with this
commit, effectively constitute a revert of PR GitoxideLabs#1725.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants